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We present a new Bayesian approach to model-robust linear re- 
gression that leads to uncertainty estimates with the same robustness 
properties as the Huber-White sandwich estimator. The sandwich 
estimator is known to provide asymptotically correct frequentist in- 
ference, even when standard modeling assumptions such as linearity 
and homoscedasticity in the data-generating mechanism are violated. 
Our derivation provides a compelling Bayesian justification for using 
this simple and popular tool, and it also clarifies what is being esti- 
mated when the data-generating mechanism is not linear. We demon- 
strate the applicability of our approach using a simulation study and 
health care cost data from an evaluation of the Washington State 
Basic Health Plan. 

1. Introduction. The classical theory of uncorrelated linear regression is 
based on three modeling assumptions: (i) the outcome variable is linearly 
related to the covariates on average, (ii) random variations from the lin- 
ear trend are homoscedastic, and (iii) random variations from the linear 
trend are Normally distributed. Under these assumptions, classical frequen- 
tist methods give point estimates and exact probability statements for the 
sampling distribution of these estimates. Equivalent uncertainty estimates 
are derived in the Bayesian paradigm, but are stated in terms of the pos- 
terior distribution for the unknown slope parameter in the assumed linear 
model. However, in a typical application none of these modeling assumptions 
can reasonably be expected to hold in the data-generating mechanism. 

We study the relationship between age and average annual outpatient 
health care costs using data from the evaluation of the Washington State 
Basic Health Plan. The plan provided subsidized health insurance for low 
income residents starting in 1989, and the evaluation study included 6918 
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Fig. 1. Outpatient health care costs from the evaluation of Washington States Basic 
Health Plan. Top panel: Average annual costs for 6918 subjects enrolled in the study. 
Middle panel: Semi-parametric smoothing estimates of the average annual cost vs. age, fit 
with Bayesian O 'Sullivan splines. Bottom panel: Semi-parametric smoothing estimate of 
the standard deviation of annual health care costs vs. age, fit with Bayesian O 'Sullivan 
splines. In each of the bottom two panels, the thick red line is the posterior mean of the 
spline fit, and the thin dashed red lines are example draws from the posterior distribution. 



subjects followed for an average of 22 months (range 1 to 44 months) [Diehr 
et al. (1993)]. Previous analysis of this data set has shown that the variability 
is heteroscedastic and not Normally distributed [Lumley et al. (2002)], and 
it appears from Figure 1 that the relationship deviates from linearity. We 
are still motivated to estimate a "linear trend" since this appears to be a 
dominant feature of the data, and while we could consider a transformation 
to stabilize the variance, this may not be desirable since the primary policy 
interest is in total or mean dollars not in log-dollars [Diehr et al. (1999)]. 
We also consider simulated data sets with similar features as illustrated in 
Figure 2. 

For the classical theory of linear regression to hold, the Normality as- 
sumption is only necessary if we want to derive exact sampling probabilities 
for the point estimates. In the large sample limit, the central limit theorem 
alone guarantees that the sampling distribution is asymptotically Normal 
and that the classical standard error estimates are correct. The linearity 
and homoscedasticity assumptions, however, are a different matter. If ei- 
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Fig. 2. Example scatterplots (red dots) and mean functions (black lines) from the four 
simulation scenarios considered in Section 5 with n — 400. The four scatterplots correspond 
to all possible combination of the linear and nonlinear mean functions and homoscedastic 
and heteroscedastic variance functions defined in Section 5. 

ther of these is violated in the data-generating mechanism, then classical 
standard error estimates are incorrect, even asymptotically. Furthermore, 
without the assumption of linearity, it is not immediately clear what quan- 
tity we are trying to estimate. 

A modern frequentist approach to analyzing data that do not conform 
to classical assumptions is to directly state what we want to know about 
moments of the data-generating mechanism by way of estimating equations, 
without making any assumptions about validity of an underlying model. The 
associated "robust" or "sandwich" -based standard errors provide accurate 
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large sample inference, at no more computational effort than fitting a linear 
model [Huber (1967); White (1980); Liang and Zeger (1986); Royah (1986)]. 
The Huber-White sandwich estimator is easy to implement with standard 
software and is widely used in biostatistics. As long as the data have a 
dominant linear structure, this strategy provides relevant inference for the 
linear trend and does not depend on detailed modeling of the variance or 
mean structures. 

Finding a Bayesian analogue of estimating equations and the sandwich 
estimator has been an open problem for some time. In this paper we de- 
scribe a novel Bayesian framework for linear regression that assumes neither 
linearity nor homoscedasticity. Even in the absence of a slope parameter, we 
give a natural definition for the "linear trend" quantity to be estimated and 
how to measure its uncertainty. We show that in the random covariate set- 
ting our Bayesian robust posterior standard deviations are asymptotically 
equivalent to the commonly used sandwich estimator. Furthermore, with 
fixed covariates our Bayesian robust uncertainty estimates exhibit better 
frequentist sampling properties than the sandwich estimator, when the true 
data-generating mechanism is nonlinear in the covariates. 

In Section 2 we set out our notation and define the model-robust Bayesian 
regression paradigm. In Section 3 we derive our main theoretical results for 
the case of randomly sampled covariates from a discrete space. In Section 4 
we consider extensions to continuous covariates and to a fixed design matrix. 
We demonstrate the properties of our methodology in a simulation study in 
Section 5, and in Section 6 we apply it to the annual health care cost data 
described above. We conclude in Section 7 with a discussion. 

2. Notation and definitions. 

2.1. Target of inference. We consider the familiar situation for multi- 
variate linear regression of having observed an n-vector of outcomes Y and 
an n X m matrix of covariate values X, with the stated objective of es- 
timating the "linear relationship" between X and Y. Before determining 
operationally how to do this, we take care to clarify the quantity of inter- 
est in terms of a true (but unknown) data-generating mechanism, without 
assuming that there is an underlying linear relationship. 

We assume that X represents n independent identically distributed obser- 
vations in of the m-dimensional covariate random variable x, and that 
Y represents n corresponding independent observations of the real-valued 
outcome random variable y. We think of the probability distribution for x 
as representing the frequency of different covariate values in the population 
to which we wish to generalize, and the distribution of y conditional on x 
as the distribution of the outcome for individuals with covariate values x. 
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Suppose that the true joint distribution for x and y admits a density func- 
tion A(-) for X (with respect to the Lebesgue measure on M™) such that for 
any measurable set A 



J A 

and a measurable function </>(•) on for the mean of y conditional on x 
such that 



Throughout, we use f as a dummy variable for x. 

Heuristically, we can say that we are interested in the "linear relationship" 
between x and the true conditional mean of y. If <j){-) were known to be linear, 
we would simply be interested in its coefficients. Since we are not assuming 
that the true mean function is linear, one possible approach is to define 
the quantity of interest as the m-vector of minimizing coefficients from the 
least-squares linear fit 



We can describe /3 as the set of m coefficients that minimizes the average 
squared error over the entire population in approximating the mean value 
of y by a linear function of x. 

The definition of /3 is essentially a statement about the scientific question 
of interest, and it is not concerned with the details of random sampling of 
the observations. We have identified the deterministic function (/){■) as repre- 
senting the mean dependence of y on x, and our objective is to approximate 
this curve by a straight line. We define /3 as the best linear approximation 
to the curve by the method of least-squares, an idea that dates to the 
early work of Gauss (1809), Legendre (1805) and Jacobi (1841). Our goal is 
inference for /3, not for the full function </){■)■ 

Freedman (2006) has pointedly described the dangers of fitting a linear 
model when such a model does not hold and then deriving "robust" stan- 
dard error estimates for an uninterpretable parameter. Our approach is fun- 
damentally different in that we explicitly recognize that the data-generating 
mechanism may be nonlinear, and we define /3 as a quantity of interest that 
summarizes the linear feature in the data-generating mechanism (this cor- 
responds to the standard definition of f3 if the data-generating mechanism 
is linear). While /? can be defined mathematically in a very general setting, 
consistent with the ideas in Freedman (2006), we recommend it as a relevant 
target of inference only when the data suggest a dominant linear trend. 





(2) 




(3) 
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2.2. Bayesian inference. Since we do not know the true mean function 
or the true covariate density A(-), we cannot directly calculate /3 from 
equation (3), and we need to take advantage of the observations in order 
to make inference about /3. To do this, we embed </>(•) and A(-) in a flexible 
Bayesian model in such a way that we can derive posterior distributions 
for these functions and, thus, derive a posterior distribution for /3. The key 
consideration in constructing the Bayesian model is that it be highly flexible, 
assuming neither linearity nor homoscedasticity. 

We adopt the conditionally Normal model for y, 

y\x,ct,{-),a\-)r^N{4>{x),a\x)), 

where we have introduced the ancillary unknown variance function cr^(-)- 
To complete the Bayesian model, it remains to specify a prior distribution, 
with probability measure 7r(A(-), (/>(•), which will be chosen to have a 
density that can be written 

(4) p{X{.),<P{-),a\-))=p,{X{-))p^^AH-),<^\-))- 

We will give specific examples of priors in the remainder of this paper. 

Defining priors for the discrete covariate case is relatively straightforward 
because we can specify a saturated model for the mean and variance func- 
tions <p{-) and cr^(-), and use a Dirichlet distribution for A(-). We derive 
our main theoretical results for that setting in Section 3. Later, in Section 
4, we present a simple and effective approach for extending the method to 
continuous covariates by using spline-based priors for (p{-) and 

Once we have specified priors in equation (4), standard Bayesian calculus 
gives a posterior distribution for (/>(•) and A(-), 

7T{X{-),(t>{-)\X,Y), 

and therefore a posterior distribution for the m-dimensional vector /3, 



(5) 7r(/3|X, y) = 7r( argmin / {(l){v) — va)'^ X{v) dv 



X,Y 



Following common practice, we define a point estimate by taking the poste- 
rior mean of /3, 

(6) ^j = E^{f3j\X,Y), i = l,...,m, 

and we use its posterior standard deviation as a measure of uncertainty 

(7) ,T^^=diag(Cov,(/3|X,y))f^ j = l,...,m. 

We can construct approximate moment-based 95% credible intervals with 
the formulation 

Chbj = ± l-96(T/3^. , j = l,...,m. 
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3. Discrete covariates. In this section we complete the specification of 
the Bayesian model for the discrete covariate case and derive our main the- 
oretical results in that setting. Let ^ = {£,1, ■ ■ ■ ,(,k) consist of K nonzero 
deterministic m-vectors that span M™ , and suppose that the covariate x can 
take these values. Let be the number of i = 1, . . . ,n such that Xi = ^fc, 
where Xi is the ith row of X. We let A(-) be a density with mass restricted 
to ^ C M*", written in the form 

K 

a(-) = J;a,%(-), 

k=l 

where 6^^, is the Dirac delta function with point mass at ^fc. That is, 

K 

P(x = 6;A(-)) = Afc, ^Afc = L 

k=l 

We use an improper Dirichlet prior for A(-) such that its density can be 
written 

pxm)^U^k' foif ^Afc/iV 

k=l \ k=l / 

The posterior distribution of A(-) is also Dirichlet with density 

K f ^ \ 

PA|x(A(-))ocnA,"'+"^- Oif 5]A,/1 . 

k=l \ k=l / 

One way to simulate values from the posterior is to draw independent gamma 
variates g/^ with shape parameters and unit scale parameters and then set 
Xk = dk/idi, ■ ■ ■ -.Qk) [Davison and Hinkley (1997)]. There is also a connection 
between the posterior distribution for x and bootstrap resampling [Rubin 
(1981)]. 

Since we can assume multiple samples at each covariate value, it is straight- 
forward to compute the posterior distribution for completely unstructured 
priors on the functions 0(-) and o"^(-). We introduce vector notation (^(•) = 
(</)!,..., (/)i^), ct2(-) = {al,...,a\) with 

4>k = H^k), al = a^{^k), 
and independent noninformative prior densities such that 

K 

P^,a2(0(-).O-^(-)) = n^'J^'fc.'^I^'^^'^fc) 
k=l 

and 

P^„ali(t>k,<^l)ocaf. 
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It turns out that for > 4, (pk has a posterior i-distribution with easily 
computable mean and scale parameters. Formulas based on the data X and 
Y are given in the Online Supplement [Szpiro, Rice and Lumley (2010)]. 

Our main theoretical result is contained in the following theorem, which 
is proved in the Online Supplement [Szpiro, Rice and Lumley (2010)]. It 
states that, asymptotically, the posterior mean point estimate derived in 
equation (6) is the least squares fit to the data X and Y, and that the 
posterior standard deviation from equation (7) has the sandwich form. The 
term "sandwich" refers to the algebraic formation in equation (8), where 
colloquially the {X^X)~^ terms are the "bread," and (X*SX) is the "meat." 



Theorem 1. For a discrete covariate space, assume that y conditional 
on X has bounded first and second moments. The m- dimensional estimate (3 
defined by equation (6) takes the asymptotic form 

/3- (x*x)~^x*y ^0, 

and assuming there are at least four samples for each covariate value, the 
corresponding uncertainty estimate has the asymptotic sandwich form 

(8) - diag[(X*X)-i(X*SX)(X*X)-Y^^ = o{n-^), 
where S is the diagonal matrix defined by 

(9) j.^,^i{Y,-Xi{X'X)-^X'Yf, ifi = j, 

\ 0, otherwise. 

The results hold conditionally almost surely for infinite sequences of obser- 
vations. 



4. Extensions. 



4.1. Continuous covariates. We consider extending our approach to a 
continuous covariate space. The situation is different from discrete covariates 
because we cannot expect there to be multiple realizations of each covariate 
value in the sampled set. The problem of estimating (/>(•) and o"^(-) as uncon- 
strained functions is unidentifiable. However, in applied regression settings 
it is almost always reasonable to assume that these are sufficiently regular to 
be approximated, using semi-parametric smoothing methods. This is a very 
weak assumption compared to assuming linearity and/or homoscedasticity. 
We describe a particular choice of spline prior that we implement in our 
examples, and leave the general issue of choosing optimal smoothing priors 
for future work. 

We restrict to scalar x in a model with an intercept, and approximate 
and log(7(-) with penalized O'Sullivan splines using a method based 
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on Wand and Ormerod (2008), extended to allow for lieteroscedasticity. We 
pick Q knots spread uniformly over the potential range of x and set 

Q 

(j){v;u) =ao + aiv + '^UqBg{v), 

q=l 

Q 

loga{v;w) ='yo + -fiv + ^WgBg{v), 

9=1 

where the Bq{-) are B-spline basis functions defined by the knot locations, 
with independent priors Oi ~ A^(0, 10^), 7^ ~ A^(0,10^). The specification 
of priors for u and w involves some transformations and amounts to the 
following. Define the matrix Z to incorporate an appropriate penalty term 
as in Section 4 of Wand and Ormerod (2008) and let 

Q 

4>{Xi;a) = ao + aiXi + '^aqZiq, 

q=l 

Q 

log a{Xi;b) = 70 + 71 + ^ bgZig 

9=1 

with independent priors ~ N{0,a'^) and bq ~ A^(0,0.1) and hyperparam- 
eter distributed as (c^)"^ ~ Gamma(0.1, 0.1). It is straightforward to sim- 
ulate from the posterior distributions using WinBUGS software [Lunn et 
al. (2000); Crainiceanu et al. (2005)]. For a prior on the covariate x we use 
the limiting case of a Dirichlet process that gives rise to the same posterior 
Dirichlet distribution as we had for discrete covariates [Gasparini (1995)]. 

An analogous result to Theorem 1 can be expected to hold under mild 
regularity conditions on the true mean and standard deviation functions 
(/)(•) and cr^(-) in the data-generating mechanism. We do not state such a 
result here, but we provide supporting evidence from a simulation study in 
Section 5. 

4.2. Fixed design matrix. Our development up to now explicitly treats 
X and Y as being jointly sampled from a random population. The fact 
that we obtain an equivalent estimator to the sandwich form suggests that 
the sandwich estimator also corresponds to the random X setting. This is 
easily seen from equation (9) since the variance estimate S in the "meat" 
involves residuals from a linear model and is bounded away from zero if 
the data-generating mechanism is nonlinear, even if the observations Y are 
deterministic conditional on X. 

A desirable feature of our approach is that it can easily be modified to ex- 
plicitly treat the fixed X scenario. To do this, we simply replace the random 
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density for X in equation (5) with a deterministic density corresponding to 
the actual sampled values 



1 

Afixcd(-) = -'^^Xii-), 

1=1 

where is the Dirac delta function with point mass at Xi. Then we proceed 
exactly as in Section 2.2 to define the quantity of interest 

(10) /3fixed = argmin / {(p{v) - vafX^^cdiv) dv. 

a J 

The point estimate for fixed X inference is 

(11) /3fixed = -E'tt (/5fixcd l-'^; ^) ; 

and the corresponding measure of uncertainty is 

(12) <T/3,fixcd = diag(Cov,(/3fixcd|X,>^)'/'). 

Notice that the only difference between the definitions of (3 and /Sgxcd is 
that in equation (5) the density A(-) is random while in equation (10) the 
corresponding density is a deterministic function of the fixed X values. 

For the discrete covariate setting we obtain the following result, which is 
proved in the Online Supplement [Szpiro, Rice and Lumley (2010)]. 

Theorem 2. For a discrete covariate space the m- dimensional estimate 
P&xed defined by equation (11) takes the form 

^fixed = {X'X)-^X'Y, 

and assuming there are at least four samples for each covariate value, the 
corresponding uncertainty estimate has the sandwich form 

'^/3,fixed = diag[(X*X)-i(X*Stx)(X*X)-Y/', 
where is the diagonal matrix defined by 

^ {Yl - Vkf, ifi = j and Xi = ^k, 




- 

^ij - 

and 

Vk 



Since the matrix T!^ in the "meat" only includes variation of Y around 
its mean, conditional on X, this form of the sandwich estimator appropri- 
ately describes sampling variability for fixed X, even if the data-generating 
mechanism is nonlinear. 



MODEL-ROBUST REGRESSION AND A BAYESIAN "SANDWICH" ESTIMATORl 



Table 1 

Frequentist properties of estimates for continuous covariate ( random X ) 



n = 400 n = 800 









Bias 


Width 


Coverage 


Bias 


Width 


Coverage 


Linear 


Equal 


Model based 


0.001 


0.170 


0.938 


-0.001 


0.120 


0.956 




variance 


Sandwich 


0.001 


0.170 


0.940 


-0.001 


0.120 


0.955 






Bayes robust 


0.002 


0.177 


0.943 


0.000 


0.123 


0.959 




Unequal 


Model based 


0.001 


0.445 


0.859 


-0.002 


0.314 


0.863 




variance 


Sandwich 


0.001 


0.601 


0.948 


-0.002 


0.426 


0.957 






Bayes robust 


0.002 


0.607 


0.955 


-0.001 


0.428 


0.956 


Nonlinear 


Equal 


Model based 


0.001 


0.262 


0.929 


-0.001 


0.185 


0.921 




variance 


Sandwich 


0.001 


0.298 


0.959 


-0.001 


0.211 


0.955 






Bayes robust 


0.009 


0.289 


0.950 


0.003 


0.207 


0.950 




Unequal 


Model based 


0.002 


0.487 


0.859 


-0.003 


0.345 


0.865 




variance 


Sandwich 


0.002 


0.648 


0.959 


-0.003 


0.460 


0.952 






Bayes robust 


-0.030 


0.657 


0.951 


-0.019 


0.460 


0.944 



5. Simulations. We consider examples with a single continuous covariate 
uniformly distributed in the interval [—10, 10], and we evaluate performance 
for four true distributions of y given x. These are obtained by taking com- 
binations of the linear response 

/n„(x) = 2 + 3.5x 

and the nonlinear response 

/noniin(2:) = 2 + 3.5x(l + | cos(x/2 - 2)|) 

as well as the equal variance model cr^q^^j = 5 and unequal variance model 

•^unequal = (5 + Example scatterplots of data from each of the four 

data-generating models, along with the corresponding mean response func- 
tions, are shown in Figure 2. 

For each of the four models we generate 1000 random realizations of X 
and Y with n = 400, 800. Results are given in Table 1 for inference based on 
random X. The model-based intervals (i.e., standard Bayesian or frequen- 
tist linear regression) fail to give approximate 95% coverage by being anti- 
conservative in all situations except for a linear response with equal variance. 
Our Bayesian robust intervals give approximately correct 95% coverage for 
all cases, just like the sandwich intervals. 

We repeat the simulation, treating the observed design matrix X as fixed. 
The results are shown in Table 2. As expected, for a linear data-generating 
mechanism the results are essentially the same as for random X inference. 
The model-based intervals are correct only for the equal variance case, while 
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the sandwich and Bayesian robust intervals give correct coverage for un- 
equal variance as well. If the data-generating mechanism is nonlinear and 
homoscedastic, then the model-based intervals and sandwich intervals are 
conservative since they implicitly account for random sampling of X, while 
the Bayesian robust intervals give approximately nominal 95% coverage. 
If the data-generating mechanism is both nonlinear and heteroscedastic, 
the sandwich intervals are still slightly conservative when compared to the 
Bayesian robust intervals, but both give approximately nominal 95% cover- 
age. In this situation the model-based intervals are anti-conservative. 

Overall, our Bayesian robust intervals give approximately nominal 95% 
coverage in all situations. The model-based and sandwich-based intervals can 
fail by being either conservative or anti-conservative, depending on details 
of the data-generating mechanism and the distinction between random and 
fixed X sampling. 

6. Health care cost data. We illustrate our methods using data from the 
evaluation of the Washington State Basic Health Plan, as described earlier in 
Section 1 and in more detail by Diehr et al. (1993). We use the variable "cost 
of outpatient the outcome and assess its "linear relationship" with 

age. The data are shown in the top panel of Figure 1, and O'Sullivan spline 
fits (see Section 4.1) to the mean and standard deviation as functions of age 
are shown in the bottom two panels. The thick red lines are the posterior 
means of the Bayesian spline fits, and the thin dashed red lines are example 
draws from the posterior distributions. 

Table 2 

Frequentist properties of estimates for continuous covariate (fixed X ) 

n = 400 n = 800 

Bias Width Coverage Bias Widtli Coverage 



Linear Equal 


Model based 


0.001 


variance 


Sandwich 


0.001 




Bayes robust 


0.002 


Unequal 


Model based 


0.001 


variance 


Sandwich 


0.001 




Bayes robust 


0.002 


Nonlinear Equal 


Model based 


0.001 


variance 


Sandwich 


0.001 




Bayes robust 


0.009 


Unequal 


Model based 


0.001 


variance 


Sandwich 


0.001 




Bayes robust 


-0.030 



0.170 


0.938 


-0.001 


0.120 


0.956 


0.170 


0.940 


-0.001 


0.120 


0.955 


0.173 


0.941 


0.000 


0.121 


0.954 


0.445 


0.859 


-0.002 


0.314 


0.863 


0.601 


0.948 


-0.002 


0.426 


0.957 


0.607 


0.951 


-0.001 


0.425 


0.953 


0.262 


0.986 


-0.001 


0.185 


0.998 


0.298 


0.999 


-0.001 


0.211 


1.000 


0.187 


0.959 


0.003 


0.128 


0.963 


0.487 


0.893 


-0.002 


0.345 


0.888 


0.648 


0.961 


-0.002 


0.460 


0.968 


0.629 


0.947 


-0.018 


0.437 


0.953 



MODEL-ROBUST REGRESSION AND A BAYESIAN "SANDWICH" ESTIMATORS 



Table 3 

Linear regression of average annual outpatient health care cost data from the 
evaluation of the Washington State Basic Health Plan 





Discrete X 


Continous X 














Model-based 


16.1 


1.25 


16.1 


1.25 


Sandwich 


16.1 


1.67 


16.1 


1.67 


Bayes robust (random X) 


16.1 


1.72 


15.9 


1.71 


Bayes robust (fixed X) 


16.1 


1.70 


15.9 


1.70 



We can regard the age covariate as either discrete or continuous, and to 
illustrate our methodology, we do the analysis both ways. Results are shown 
in Table 3. The difference in average annual outpatient health care costs 
associated with a one year difference in age is estimated to be 16.1 dollars, 
with a model-based standard error of 1.25 dollars. As expected, in light 
of the heteroscedasticity, the sandwich form gives a larger standard error 
estimate of 1.67 dollars. The uncertainty estimates from our Bayesian ro- 
bust estimators range from 1.70 dollars to 1.72 dollars, agreeing very closely 
with the sandwich values. The point estimate is nearly identical to the least 
squares fit (15.9 dollars) when we model age as continuous in the Bayesian 
robust approach; the slight difference is probably due to approximations in- 
volved in fitting the spline model. The Bayesian robust standard deviations 
are nearly identical when we model X as random or fixed, indicating that 
random variations in average costs conditional on age contribute more to 
the uncertainty than does nonlinearity in the data-generating mechanism. 

7. Discussion. The main contribution of this paper is a model-robust 
Bayesian framework for linear regression that gives uncertainty estimates 
equivalent to the sandwich form for random covariate sampling and with 
superior sampling properties for a fixed design matrix. In both situations, 
our estimates correctly account for heteroscedasticity and nonlinearity, in 
the sense of giving asymptotically valid frequentist sampling properties. The 
idea is to describe the data-generating mechanism nonparametrically, and 
then to define a functional of the true data-generating mechanism as the 
quantity of interest for inference. Once this quantity is defined, we follow 
common Bayesian practice and derive a point estimate as its posterior mean 
and an uncertainty estimate as its posterior standard deviation. In the case 
considered here, the quantity of interest is the least-squares linear fit to the 
(potentially nonlinear) mean of the outcome random variable y conditional 
on the covariate random variable x. 

Our conceptual framework is powerful because it provides a general def- 
inition of linear regression in a model-agnostic framework. We can move 
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seamlessly between classical model-based inference and robust sandwich- 
based inference simply by using different priors for 0(-) and cr(-). If sub- 
jective prior information is available, this can also be included without any 
modification to the methodology. Regardless of what information is encoded 
in the priors, our target of inference remains the same and has an explicit 
interpretation in terms of the trend in the data-generating mechanism. 

Our estimation approach transparently distinguishes between the cases 
where the observed covariates are regarded as random and where they are 
regarded as a fixed design matrix. We obtain good frequentist coverage prop- 
erties in both situations, and our estimates are equivalent to the sandwich 
form when the covariates are treated as random. For the fixed design matrix 
setting, our Bayesian robust intervals can provide notably better sampling 
properties than the sandwich estimator in the situation where the true data- 
generating mechanism has a mean that is nonlinear in the covariates. This 
is true asymptotically, since the sandwich estimator overestimates the stan- 
dard errors in this situation by confusing the part of the residuals that results 
from nonlinearity in the data-generating mechanism (which does not vary 
across samples and should not contribute to standard error estimates) with 
the random component in the residuals (which varies across samples and 
should be accounted for in estimating standard errors). This result can be 
seen in our simulation examples in Section 5 and by comparing the expres- 
sions for standard errors in Theorems 1 and 2. Elsewhere, we have derived 
similar results for a fixed design matrix by employing a Bayesian decision 
theoretic formalism [Rice, Lumley and Szpiro (2008)]. 

In the continuous covariate case, we use splines to approximate the mean 
and variance functions for y conditional on x. This is necessary because the 
mean and variance are not separately identifiable from a single sample at 
each covariate value. It can be regarded as a weakness in our approach, but it 
also suggests an opportunity to improve on the small-sample performance of 
the sandwich estimator by incorporating additional prior information. Our 
use of splines will work in any situation where the true mean and variance 
are smooth functions of the covariates. This smoothness is a very reasonable 
assumption for applied problems. In fact, the implicit assumption of the 
sandwich estimator that the variance function has no structure whatsoever 
seems overly permissive. By using properly calibrated splines or other semi- 
parametric priors, it should be possible to improve upon the small-sample 
performance by borrowing information from nearby covariate values. This 
approach appears particularly promising in the context of generalized esti- 
mating equations, where there may be many samples but too few clusters 
to accurately estimate a completely unstructured covariance matrix. 
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SUPPLEMENTARY MATERIAL 

Proofs of theorems in "Model robust regression and a Bayesian 'sandwich' 
estimator" (Szpiro, Rice, and Lumley) (DOL 10. 1214/10- AOAS362SUPP; 
.pdf). We provide proofs of the theorems stated in the paper "Model ro- 
bust regression and a Bayesian 'sandwich' estimator" by Adam A. Szpiro, 
Kenneth M. Rice and Thomas Lumley. 
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